MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenomeassembled genomes

We propose MT-MAG, a novel machine learning-based software tool for the complete or partial hierarchically-structured taxonomic classification of metagenome-assembled genomes (MAGs). MT-MAG is alignment-free, with k-mer frequencies being the only feature used to distinguish a DNA sequence from another (herein k = 7). MT-MAG is capable of classifying large and diverse metagenomic datasets: a total of 245.68 Gbp in the training sets, and 9.6 Gbp in the test sets analyzed in this study. In addition to complete classifications, MT-MAG offers a “partial classification” option, whereby a classification at a higher taxonomic level is provided for MAGs that cannot be classified to the Species level. MT-MAG outputs complete or partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. To assess the performance of MT-MAG, we define a “weighted classification accuracy,” with a weighting scheme reflecting the fact that partial classifications at different ranks are not equally informative. For the two benchmarking datasets analyzed (genomes from human gut microbiome species, and bacterial and archaeal genomes assembled from cow rumen metagenomic sequences), MT-MAG achieves an average of 87.32% in weighted classification accuracy. At the Species level, MT-MAG outperforms DeepMicrobes, the only other comparable software tool, by an average of 34.79% in weighted classification accuracy. In addition, MT-MAG is able to completely classify an average of 67.70% of the sequences at the Species level, compared with DeepMicrobes which only classifies 47.45%. Moreover, MT-MAG provides additional information for sequences that it could not classify at the Species level, resulting in the partial or complete classification of 95.13%, of the genomes in the datasets analyzed. Lastly, unlike other taxonomic assignment tools (e.g., GDTB-Tk), MT-MAG is an alignment-free and genetic marker-free tool, able to provide additional bioinformatics analysis to confirm existing or tentative taxonomic assignments.

• Figure S2 -Task 1 (sparse) test set: high-quality MAGs reconstructed from human gut microbiomes from a European Nucleotide Archive study titled "A new genomic blueprint of the human gut microbiota" (GBHGM) [16].
Fig S1. Genome size, contig count and percent GC distributions for all genomes in the HGR database. Recall that HGR comprises 1,952 MAGs and 553 microbial gut Species-level genome representatives from the human-specific reference database. From the histograms, we observe that the genome sizes are centered around 2 Mbp; the contig count is right-skewed and peaks at around 38; the percent GC peaks at around 59%.

Fig S2.
Genome size, contig count and percent GC distributions for all GBHGM MAGs. Recall that GBHGM comprises 3,269 high-quality MAGs reconstructed from human gut microbiomes from a European Nucleotide Archive study titled "A new genomic blueprint of the human gut microbiota" (GBHGM) [16]. From the histograms we observe that the genome sizes are centered around 2 Mbp; the contig count is right-skewed and peaks at around 38; the percent GC peaks at around 44%. Fig S3. Genome size, contig count and percent GC distributions for all genomes (top panels) and Species-level representative genomes in GTDB (bottom panels). Recall that the full GTDB comprises 311,480 bacteria genomes and 6,062 archaeal genomes. From the histograms in the top panels, generated for all genomes in GTDB, we observe that the genome sizes are multimodal and peak at around 2 Mbp, 3 Mbp and 5 Mbp; the contig count is right-skewed and peaks at around 13; the percent GC peaks at around 51%. From the histograms in the bottom panels, generated for Species level representatives in GTDB, the genome sizes are right-skewed and peak at around 2 Mbp; the contig count is right-skewed and peaks at around 13; the percent GC peaks at around 41% and 64%. Fig S4. Genome size, contig count and percent GC distributions for all 913 cow rumen MAGs. Recall that there are 913 "draft" bacterial and archaeal genomes assembled from rumen metagenomic sequence data derived from 43 Scottish cattle. From the histograms, we observe that the genome sizes are centered around 2 Mbp; the contig count is right-skewed and peaks at around 63; the percent GC peaks at around 50%.

Seed fixing
To ensure our results are replicable, our presented results were produced by setting the random seed to 0 for all processes that involve randomness (unless otherwise mentioned).

Special cases
• Dataset being too large for Task 1 (sparse). For the Phylum level classification, as well as for the over-represented lineages from Class to Genus level classifications, we randomly selected 10% of contigs from any child taxon with more than 100 contigs. This reduced sampling bias in the datasets and reduced computational complexity without omitting any taxon within the taxonomy.
• Dataset being too small for Task 2 (dense). One type of special case concerns the taxa with insufficient number of representative genomic fragments. More specifically, to perform five-fold cross-validation, each child taxon needs to have at least five representative genomic fragments. To address this issue, in such cases, and for the Phylum to Genus level taxa, the child taxa with fewer than five representative genomic fragments were removed.
• Dataset being too large for Task 2 (dense). For Domain, Phylum, and over-represented lineages from Class to Genus, where eMLDSP consumes an extreme large amount of random access memory, we randomly selected 10% of the representative genomes from any child taxon with more than 100 representative genomes. For less-well represented lineages from Classes to Genera, we selected all representative genomes. For Species, we selected all representative and non-representative genomes.
• Imbalanced classification for Task 1 (sparse) and Task 2 (dense). One scenario which could lead to problematic classifications is that of imbalanced taxon sizes. This is because significant differences in child taxon sizes may violate the assumption, necessary for most classification algorithms, that the number of the training instances for each class be roughly the same. Imbalanced class sizes may pose a challenge for predictive performance, especially for the classes with few training instances [40]. In general, there is a trade-off between balanced class sizes and the amount of variability reflected in each class. In the computational experiments in this paper, the situation of imbalanced taxon sizes was dealt with differently, depending on the taxonomic rank. For high-level classifications (i.e., Domain to Phylum, Phylum to Class, Class to Order, Order to Family), no pruning of over-sampled child taxa was performed. This is because the number of (representative) genomes in a child taxon is proportional to the amount of variability in the taxon and, for high-level classifications, the differences in the amounts of variability of different child taxa can be significant. Since the training instances are intended to capture and represent the differences in the amount of variability, the oversized child taxa must be preserved as being reflective of their respective amounts of variability, and were not pruned.
The situation is different for low-level classifications (i.e., Family to Genus, and Genus to Species), where the differences in the amounts of variability of various child taxa is much smaller, and imbalanced taxon sizes can have a negative impact on the training model and its performance. In these latter cases, all training child taxa were pruned to relatively similar sizes as follows. After sampling, the number of contigs/representative genomic fragments in each child taxon was counted. If the number of contigs/representative genomic fragments was greater than 30, then 30 contigs/representative genomic fragments were randomly selected and used for training.

DeepMicrobes training
For Task 2 (dense), via a hyper-parameter search, we found the attention model with the following hyper-parameter values yielded the best performance: the embedding dimension being 50, the batch size being 1024, the learning rate starting with 0.001. 30,000 150 basepair (bp) paired-end reads were simulated from HiSeq 2500 error model in ART Illumina read simulator [25], with the mean fragment size of 200, standard deviation of 50 bp per species and seeds 1, 2 and 11.